Red wine is a type of wine made from dark-colored grape varieties.This is a analysis of a dataset containing the quality, alcohol content and other attributes of almost 1600 red wine samples to get some useful insights about red wines.
1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily) . unit: g / dm^3. 2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste. unit: g / dm^3.
3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines. unit: g / dm^3.
4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet. unit: g / dm^3.
5 - chlorides: the amount of salt in the wine. unit: g / dm^3.
6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine. unit:mg / dm^3.
7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine. unit:mg / dm^3
8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content. unit: g / cm^3
9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant. unit:g / dm3.
11 - alcohol: the percent alcohol content of the wine. unit: % by volume.
Output variable (based on sensory data): 12 - quality (score between 0 and 10)
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
This dataset contains 12 diffeent attributes and about 1600 red wine records.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
As we can see, all the values of pH scale of red wines are between 2.740 to 4.01. It means that all the red wines are acidic in nature. Because liquids with ph scales less than 7 are acidic in nature and more than 7 are basic in nature.The plot above gives us a normal like distribution of pH scale values in the red wine. Most red wines have pH scale between 3.2 to 3.4.There are some outliers with very high pH scale of 4.01 and very low as 2.74. The median pH value is 3.310.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
The volatile acids present in most of the wines are between 0.3 g/dm^3 to 0.7 g/dm^3. There are some outliers after 1.1 g/dm^3. The maximum volatile acid present in samples is 1.58 which is not at all good because less the volatile acid, better the quality of the wine. The median of volatile acid is 0.52 g/dm^3. So the wine samples with volatile acids 0.12 g/dm^3 must have very good quality and the wine samples with volatile acids more than 1.0 g/dm^3 must be of low quality. Lets see.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
Citric acid can add freshness and flavor to the wine. From the plot above it is clear that citric acid is present in very small quantity. Most of the wines contain 0 g/dm^3 citric acid.But there are others with citric acid between 0.1 to 0.5. The outlier with citric acid 1 g/dm^3 must have a very good taste and thus it must have very good rating. Lets see if this turns out to be true.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
The lowest alcohol content in one of the wine is 8.40 and the highest is 14.90 %. The large number wines contains alcohol content between 9.0 % to 9.5 %. Most of the wines contains alcohol contents ranging from 9.0 % to 12 %.
Most red wines contains residual sugar between 1.5 g/dm^3 to 4 g/dm^3. Very large number of red wines contains 2 g/dm^3 residual sugar.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
Very few wines contains salts i.e chlorides below 0.03 g/dm^3 and above 0.15 g/dm^3. Large number of wine contains salt ranging from 0.075 g/dm^3 to 0.085 g/dm^3. From the data set description, we dont know how the salts impact on the quality of the wine. So lets see the how much salt is contained in the wines with quality rating above 5.
These are the wines with chlorides content above 5. There are some wines with more salt than 0.18 g/dm^3. But most wines with rating above 5 have salts between 0.03 g/dm^3 to 0.13 g/dm^3.
These are the wines with chlorides content below 5. From the plot above it is not clear how salt affects the qualtiy of wine.Because there are some wine with salts greater than 0.13 g/dm^3 but most of them have salt range of 0.03 g/dm^3 and 0.13 g/dm^3 which is the same for high quality wine. I think there is no relationship between chlorides and quality of wine. Not sure yet. Lets see.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
The lowest density here is 0.9901 and highest density is 1.0037 g/cm^3. Most of the red wine have density in range of 0.995 to 0.997 g/cm^3.
This simple plot above shows that most of the wine got moderate quality ratings of 5 and 6. There are very few with very high quality ratings of 8 and low quality rating of 3. We will explore more to find out what lead them to that quality ratings.
There are 1599 wines in the dataset with 12 features (fixed acidity, volatile acidity, citric.acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol and quality).
The variable quality is ordered factor variable with the following levels.
(worst) ————-> (best)
Quality: 0 —–> 10
Other observations:
- Most wines have moderate ratings of 5 and 6.
- Large number of wines does not contain citric acid at all i.e 0 g/dm^3.citric acid gives taste and freshness to the wine.So this is interesting fact.
- The large number red wines contain alcohol content between 9.0 % to 9.5 %.
- How cholrides i.e salts affect the quality of the wine is not clear yet.
The main feature of this dataset is the quality of the red wine. I would like to find out which attributes leads to the good quality of the red wine. And I think citric acid and combination of some other variables can be used to make a predictive model for predicting the quality of the red wine.
volatile acidity, alcohol content, pH, residual sugar and density will mostly leads to the quality of red wine. But I think citric acid, volatile acids and alcohol content will contribute more than others in our predictive model.
Till now i have not created any variable.
No. The data is alredy in tidy format. But I have subsetted the data in some places. For example, while looking at the chlorides and quality relation.
From the subset of data, the only strong corelation with quality of wine is alcohol where as a moderate corelation is volatile acidity. Fixed acidity, citric acid and sulphates have small corelation with quality. We will explore other variables by plotting them against quality.
Here, I have converted the quality column to factor as quality is a catogorial variable. With this modification, I have plotted above boxplot of quality of red wine and alcohol percent in it. From this plot we can clearly see the relation between alcohol content and wine quality. Higher the alcohol content greater is the quality of red wine. The category of very great quality of red wines with quality of 8 have median above 12 %.
##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = redWine)
##
## ================================
## (Intercept) 1.875***
## (0.175)
## alcohol 0.361***
## (0.017)
## --------------------------------
## R-squared 0.227
## adj. R-squared 0.226
## sigma 0.710
## F 468.267
## p 0.000
## Log-likelihood -1721.057
## Deviance 805.870
## AIC 3448.114
## BIC 3464.245
## N 1599
## ================================
Based on the above r^2 value, alcohol content explains about 23 % of variance in quality of red wine.
This plot above clearly indicates that there is a negative corelation between quality and volatile acidity. This means that as we go from low quality to high quality of the red wines, the volatile acidity content decreases. In high quality red wines the volatile acidity content is low.
There is a significant diference in means of citric acid content in the red wines from each quality category. We can see thers is a small positive corelation between quality and citric acid content of the red wine. So we use citric acid content in the red wine while creating our predictive model.
There is no significant diference in means of chlorides content in the red wines from each quality category. There is some difference between means of chloride content of red wine from quality 3 and 4 but it is not enough because from quality of 4 onwards there is only slight difference. So we will not take into account the chlorides content in the red wine while creating our predictive model.
From the plot above, there is some noticable positive corelation between quality and the sulphate content of the red wine. So we will consider the sulphates content while finalizing our predictive model.
There is a small corelation of quality with density of the red wine in a negativ direction i.e low quality wines have more density than the high quality. High quality wines are less denser than low quality.
As the fixed acidity increases, the density of the red wine also increases. We see high quality red wines aontains slightly higher fixed acidity than low quality red wines.
We see here there is a strong corelation in a negative direction between pH value and fixed.acidity. Similarly there is a strong corelation in a negative direction between citric acid and pH value.
The Quality of the red wine corelates strongly with alcohol content. As the alcohol content increses, the quality of the red wine also increases.Based on the R^2 value, alcohol content explains about 23 percent of the variance in quality. Other features of interest can be incorporated into the model to explain the variance in the quality
I think alcohol content alone can be said to be strongly corelated with the quality of the red wine. The volatile acidity is moderately related with the quality of the red wine. But its in a negative direction. This means that as the volatile acidity of the red wine increases, the quality of the red wine goes down i.e decreases which we expected earlier in the univariate analysis.
There is a small but very close to moderate corelation among the quality and the citric acid content of the red wine. This corelation is in positive direction. This means that as we increase the citric acid content of the red wine, the quality of the red wine too increase as it provides taste and freshness to the red wine. This was also suspected in the univariant analysis and it came out to be true.
As I was looking for how chlorides i.e salts affects the quality of the red wine in the univariant analysis but I was not getting any useful insight about it, the scatterplot of chlorides against quality has made it clear that there is no considerable effect of salts on the quality of the red wine.
Though the main feature I am interested here is the quality of the red wine, I looked at the corelation between pH value - citric acid and pH value - fixed acidity. When I looked at the corelation plot from ggpairs, I saw a clear corelation between these variables in the negative direction. But I was confused about it. Because I was thinking, as citric acid content and fixed acidity of the wine increases, the overall acidity i.e pH value should increase. But here the pH value was decreasing. So When I plotted these variables against each other then I noticed my mistake. I was considering the higher pH value leads to higher acidity. But pH value is exactly opposite to that. pH value for higher to lower starts from 1 to 7. So in order to see the acidity increase, there should be decrease in the pH value and that was the actually happening in the plot.
I found alcohol content is the only strongest corelation with the quality of the red wine. But i found volatile acidicity is moderately corelated with quality of the red wine.
We see clearly here that the high quality red wines are the one with the ratio of alcohol with the volatile acid above 20 %. And the increase in citric acid leads to increase in quality of the red wine. So the high quality of the red wine is mainly depends on high alcohol and citric acid content and low content of the volatile acidity.
Now we know the lowest volatile acidity gives high quality and this is show in this plot as most of the good quality red wines contains less than 0.7 g/dm^3. The pH value of most of the red wines lies between 3.0 to 3.6. We see there are some high quality as well as low quality wines at the very low pH value and the very high pH value. So from this fact we can say that the acidity of the wine does not linearly affects the quality of the wine but on average most wine have acidity between 3.0 to 3.6.
## [1] 0.1099032
Till now we know that citric acid and alcohol content affects the quality of the red wine. But we did not now how these two variables are related to each other. From this plot we can say that there is a very small corelation of between citric acid and alcohol content of the red wine. To be precise r = 0.109.
## [1] 0.2513971
## [1] 0.09359475
## [1] 0.4761663
The sulphates content in the red wine affects the quality of the red wine with corelation among them is 0.25. As alcohol content is the only variable having a strong corelation of 0.476 with the quality, I want to check what other variables affects the alcohol content in the wind. So from the plot above it is clear that sulphates and alcohol content have very small corelation. Precisely r = 0.0936 which is not at all significant.
## [1] 0.2056325
All good quality red wines have pH value between 3.0 to 3.6 and alcohol content grater than 10 . From the plot above we can see there is a weak corelation of 0.20 between alcohol and pH value. This means that as alcohol content increases the acidity of the red wine decreases.
## [1] 0.6680473
This plot shows the strong linear corelation in a positive direction between density and fixed acidity. So it means that as the fixed acidity of the red wine increase the density of the red wine also increases. But we saw earlier that for good quality wine the density is less but fixed acidity is more. So there must be other variables which reduces the density of the red wine. Lets find out.
## [1] -0.4961798
## [1] -0.3416993
As we suspected, the density of the red wine is reduced by increase in alcohol content as they have near strong corelation of 0.49 in a negative direction. We see here that the pH value also reduces the density of the red wine. This make sense because we saw above that the increase in alcohol content, increases the pH value. It means more acidic compounds have higher density as pH scale for highr to lower acidicity starts from 1 to 7.
Now every variable is making sense, lets create our predictive model for prediction of the quality of the wine.
##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = redWine)
## m2: lm(formula = quality ~ alcohol + sulphates, data = redWine)
## m3: lm(formula = quality ~ alcohol + sulphates + citric.acid, data = redWine)
## m4: lm(formula = quality ~ alcohol + sulphates + citric.acid + fixed.acidity,
## data = redWine)
## m5: lm(formula = quality ~ alcohol + sulphates + citric.acid + fixed.acidity +
## volatile.acidity, data = redWine)
## m6: lm(formula = quality ~ alcohol + sulphates + citric.acid + fixed.acidity +
## volatile.acidity + density, data = redWine)
## m7: lm(formula = quality ~ alcohol + sulphates + citric.acid + fixed.acidity +
## volatile.acidity + density + pH, data = redWine)
##
## ======================================================================================================================
## m1 m2 m3 m4 m5 m6 m7
## ----------------------------------------------------------------------------------------------------------------------
## (Intercept) 1.875*** 1.375*** 1.434*** 1.138*** 2.202*** 30.401* 23.899
## (0.175) (0.177) (0.176) (0.214) (0.224) (15.163) (16.956)
## alcohol 0.361*** 0.346*** 0.338*** 0.346*** 0.320*** 0.298*** 0.308***
## (0.017) (0.016) (0.016) (0.016) (0.016) (0.020) (0.023)
## sulphates 0.994*** 0.814*** 0.821*** 0.701*** 0.732*** 0.716***
## (0.102) (0.107) (0.106) (0.103) (0.104) (0.105)
## citric.acid 0.513*** 0.312* -0.469*** -0.460*** -0.482***
## (0.093) (0.125) (0.137) (0.137) (0.139)
## fixed.acidity 0.033* 0.057*** 0.077*** 0.065**
## (0.013) (0.013) (0.017) (0.022)
## volatile.acidity -1.343*** -1.302*** -1.308***
## (0.113) (0.116) (0.116)
## density -28.268 -21.233
## (15.198) (17.275)
## pH -0.149
## (0.174)
## ----------------------------------------------------------------------------------------------------------------------
## R-squared 0.227 0.270 0.284 0.286 0.344 0.345 0.346
## adj. R-squared 0.226 0.269 0.282 0.284 0.342 0.343 0.343
## sigma 0.710 0.690 0.684 0.683 0.655 0.655 0.655
## F 468.267 294.988 210.501 159.804 167.023 139.977 120.065
## p 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -1721.057 -1675.142 -1659.955 -1657.046 -1589.648 -1587.913 -1587.544
## Deviance 805.870 760.894 746.576 743.865 683.728 682.245 681.930
## AIC 3448.114 3358.284 3329.910 3326.091 3193.297 3191.826 3193.088
## BIC 3464.245 3379.793 3356.795 3358.354 3230.937 3234.843 3241.482
## N 1599 1599 1599 1599 1599 1599 1599
## ======================================================================================================================
autoplot(m7)
The main variable mostly affecting the quality of the red wine is the alcohol content in the red wine. Apart from alcohol content more amount of sulphates, fixed acids and citric acid content gives better quality of the red wine.
The amount of volatile acid, density and the pH value affects the quality in a negative direction. The more volatile acid present in the red wine, the quality of the red wine goes down. Similarly high quality red wines have low density as compaired to low qulity red wines.
pH value shows how acidic a liquid is. In this dataset of red wines, we found both the good quality and the bad quality, in less acidic wines as well as more acidic wines. So pH value does not really significantly affects the red wine quality but most of the red wines with good quality are not more or less acidic in nature. The majority of them are between pH value 3 to 3.6.
Well I saw a surprising corelation between density and fixed acidity. Good quality wines have less density but more fixed acidity. But increase in fixed acidity leads to increase in increase in density still it was affecting in a positive direction to the quality of the red wine. So I suspected that there must be some other variable which reduces this increased density and I found out that that variables are alcohol content and pH value. Here I came to know how inter related these different variables are.
I created a linear model starting from quality and alcohol content as alcohol content has strong corelation with the quality.
Then I add other moderately corelated variable sulphates content. And finally variables with small corelation like citric acid, fixed acidity, density, pH value and volatile acidity.
The variables in the linear model account for around 35 % of the variance in the red wine quality.
The quality of the red wine is the main output variable in the dataset. We can see that large number of red wines have moderate quality of 5 and 6 while ther are some low qualit red wines with quality 3 and 4. There are some good quality red wines with quality 7 but very few have the best quality of 8. These best quality red wines are less than 50 out of 1599 red wines.
Red wines with good qualty have higher alcohol content than those with low quality. Alcohol content is strongly corelated with the quality of the red wine. We can see that median of vlaue of alcohol content for quality 8 red wines is around 12.2 % while for quality 3 red wine it is 9.9 %.
From the plot above we can clearly see the varaibles affecting the quality of the red wine. High quality red wine have more contents of citric acid and alcohol and, less contents of volatile acidity. Whereas low quality red wine have less contents of citric acid and alcohol and, more contents of volatile acidity. Form this plot we get the idea of which variables to chose for our linear predictive model.
The red wines data set contains information on almost 1600 red wines across twelve variables. I started by understanding the individual variables in the data set, and then I explored interesting questions as I continued to make observations on plots. Eventually, I explored the quality of the red wines across many variables and created a linear model to predict red wine quality.
There was a clear trend between quality and alcohol content of the red wines. There was some other variables like citric acid which gives taste and freshness to the red wine, fixed acidity, sulphates etc had positive corelation with quality of the red wine. Some variables like density,pH and volatile acidity etc had negative corelation with quality of the red wine.
I was surprised to see the fact that salts and residual sugar content does not play an important role in deciding the quality of the red wine.
The model I created with various variables here is a linear model which accounts for around 35 % of the variance in the red wine quality. This 35 % is very less than I expected. So it is not a good model for predicting the quality of the red wine.There are some other model than the linear model which can be used to create the new and better predictive model.